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1 Power laws in genomic quantities 

A few years ago it was noticed that the distributions of gene-family sizes in fully-sequenced genomes 
follow power-law distributions (8||6l. Since then different authors have shown that there is in fact a large 
array of genomic features that show power law distributions. Almost all of these concern the distributions 
of genomic features within a single genome. For instance, it is shown in [ 13 1 that the number of genomic 
occurrences of DNA words, protein folds, superfamilies, and families all follow power-law distributions. 
Power-law distributions are also found in the structure of the 'protein universe' fT2l : the number of 
protein families per fold is power-law distributed, and so is the number of different assigned biological 
functions per fold [13|. Power-laws also appear in the structure of cellular interaction and regulatory 
networks. For example, the number of genes that a given gene interacts with is power-law distributed. 
This holds both when one defines 'interaction' between genes on the level of the proteins that they 
encode 071 191 1141 or if one defines it at the level of co-regulation of the expression of the genes fl6l . 
The experimental data on transcription regulatory networks is rather incomplete but they also suggest 
that the number of genes regulated per transcription factor might have power-law tails 0IE1- Finally, 
power-laws also appear in cellular metabolic networks; the number of substrates that any given substrate 
interacts with is power-law distributed I10IIT91 . 

2 Comparing genomic features across genomes 

Note that almost all the power-law distributions just mentioned refer to statistics that are taken over a 
single genome or cellular network. The statistics of genomic features across genomes has been much less 
(if at all) investigated. To a large extent this may be because until recently there simply weren't enough 
fully-sequenced genomes to obtain meaningful statistics across genomes. However, this situation is 
changing rapidly. 

There are currently about 150 fully-sequenced microbial genomes in genbank and this number ap- 
pears to grow exponentially as I have shown in fT8l . Figure ^ shows an updated plot of the number of 
fully-sequenced microbial genomes as a function of time (see Methods section). The current number of 
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Figure 1: The number of fully-sequenced microbial genomes submitted to the genbank database as a 
function of time (in years). The vertical axis is shown on a logarithmic scale. The black line is the 
least-square fit to an exponential, n = 2(* -1993 - 4 )/ 1 38 . 



available microbial genomes is only large enough to allow for meaningful cross-genome comparisons of 
the most basic statistics of gene-content and organization and this is what I will focus on in this chapter. 
However, the exponential fit in figure [^predicts that the number of sequenced genomes doubles roughly 
every 17 months. This implies that by 2010 we may have as many as 3000 fully-sequenced microbial 
genomes available. It is therefore clear that much more detailed comparative genomic analyses than the 
ones presented in this chapter will become possible over the next decade. 

In fT8l I compared the number of genes in high-level functional categories across all sequenced 
genomes and showed that they follow power-laws as a function of the total number of genes in the 
genome. In this chapter I will recapitulate these results and augment them in several ways. In particular, 
I have extended the analysis to all functional categories that are represented by at least 1 gene in each bac- 
terial genome and have recalculated the observed exponents based on the latest genomic data. Second, I 
will go into more detail regarding the implications of the observed scaling laws for the general organiza- 
tion of gene-content across genomes and discuss an evolutionary model that relates the observed scaling 
behavior in gene content to fundamental constants of the evolutionary process. Finally, I will discuss 
theoretical explanations for the category of transcription regulatory genes which shows approximately 
quadratic scaling with the total number of genes in the genome. 



3 Scaling in functional gene-content statistics 

To count and compare the number of genes in different functional categories for all sequenced genomes 
one needs to first define a set of functional categories and then annotate all genomes in terms of these 
functional categories. I used the biological process hierarchy of the Gene Ontology [4] to define func- 
tional categories, and Interpro annotations of fully-sequenced genomes to associate genes with GO cate- 
gories. The details of the annotation procedure are described in the Methods section. The result is a count 
of the number of genes associated with each of the GO categories in the biological process hierarchy for 
each of the sequenced genomes. 

The set of genomes in this study consists of 116 bacteria, 15 Archaea, and 10 Eukaryota. In this 



chapter I will focus solely on the bacterial data since this is the only kingdom for which there is sufficient 
data to obtain meaningful statistics. The reader is referred to [ 18 1 for a discussion of the observed scaling 
laws in Archaea and Eukaryota. 

There are 154 GO categories in the biological process hierarchy that have at least 1 associated gene 
in each of the 116 bacterial genomes. I will refer to these categories as the 'ubiquitous' categories. The 
results for a selection of 6 of these ubiquitous GO categories are shown in figure [2] The figure shows 
the dependence of the number of genes in each category on the total number of genes with annotation in 
the genome. The left panel shows the categories "signal transduction" (red), "carbohydrate metabolism" 




Figure 2: The number of genes involved in (left panel) signal transduction (red), carbohydrate 
metabolism (blue), and DNA repair (green), and (right panel) any biological process (blue), transcrip- 
tion regulation (red), and protein biosynthesis (green) as a function of the total number of genes in the 
genome that have any functional annotation at all. Each colored dot represents the counts for a sin- 
gle genome. Both axis are shown on a logarithmic scale. The straight lines show power-law fits: (left 
panel) n = 0.000015# L95 (red), n = 0.0633 ' 96 (blue), n = 0.37c/ 605 (green), and (right panel) 
n = 0.0000955 1 - 83 (red), n = 0.92 5 - 95 (blue), and n = 30.8c/ 016 (green). 

(blue), and "DNA repair" (green), while the right panel shows the categories "transcription regulation" 
(red), "biological process" (blue), and "protein biosynthesis" (green). Each dot represents the counts 
in a single bacterial genome and both axes in Fig. |2] are shown on a logarithmic scale. Note that in 
[ 1 8 ] a similar plot was shown but with the horizontal axis representing the total number of genes rather 
than the total number of annotated genes. Since the total number of annotated genes is to a very good 
approximation proportional the total number of genes, the results are virtually identical whether one uses 
the total number of genes or the total number of annotated genes. I decided to use the total number of 
annotated genes on the horizontal axis in figure|2lpartly to illustrate this fact. In addition, the genome size 
in bacteria is also to a very good approximation proportional to the total number of genes in the genome. 
Thus, if we had used the genome size instead of the number of annotated genes on the horizontal axis 
Fig. |2]would again have looked virtually identical. 

The dots of each color in Fig. [2] fall approximately on a straight line. Thus, the logarithms of the 
number of genes n c in a category c and the total number of genes g (or the number of annotated genes or 
the genome size) are approximately linearly related: 

log(n c ) = cx c log(sO + b c . (1) 
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In other words, the number of genes n c in a category increases as a power-law in the total number of 
genes g: 

n c = X c g ac . (2) 

For the 6 functional categories shown, the exponents a c of the best power-law fits are indicated in 
the figure caption. The fits were obtained using the procedure described in the Methods section. The 
exponents range from a = 0.16 for protein biosynthesis to a = 1.95 for signal transduction. 

To further show the range and variation of the observed exponents Fig. [3] shows the inferred ex- 
ponents and their 99% posterior probability intervals for a selection of 20 functional categories. The 
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Figure 3: The 99% posterior probability intervals for the scaling exponents of a selection of functional 
categories. The functional categories are indicated on the left of each bar and the red section of the bar 
indicates the 99% posterior probability interval for the exponent of that category. The categories are 
ordered from top to bottom by increasing lower-bound of the posterior probability interval. 

exponents range from close to zero to roughly 2. Note that, for a category c with exponent a c the relative 
proportion p c of genes in the genome scales as p c = \ c g ac ~ l . That is, when a c < 1 the proportion of 
genes in the category will decrease with genome size, while for a c > 1 the proportion of genes in the 
category will increase with genome size. Thus, for a category c with exponent close to 2, the proportion 
p c will increase almost linearly with genome size. The behavior of different categories thus ranges from 
categories where the number of genes is almost constant with genome size (a c ~ 0), to categories where 
the proportion of genes in the genome increases linearly with genome size (q c ks 0). 

The general picture that emerges from Fig. [3] is that the proportion of genes in essential low-level 
functional categories such as protein biosynthesis and DNA replication decreases with genome size, 
whereas the proportion of genes that play regulatory roles such as genes involved in signal transduc- 
tion and transcription regulation increases approximately linearly with genome size. In between these 
extremes is a number of categories, including different metabolic functions, for which the exponent is 
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roughly 1, indicating that the genomic percentage of genes in these categories is roughly independent of 
genome size. 



3.1 Upper bound on genome size 

The observed quadratic scaling of the number of regulatory genes with the total number of genes in 
the genome obviously cannot extend to arbitrarily large genome sizes. If we extend the red curves, 
corresponding to "transcription regulation" and "signal transduction", in figure |2] to the right, we will 
eventually reach a point where the number of signal transducers and transcription regulators would be 
larger than the total number of genes in the genome and this is obviously impossible. Thus, if all bacterial 
genomes obey the relations indicated in our figure, there must be an upper bound on bacterial genome 
size. A naive upper bound is obtained by demanding that the number of genes in any category is less than 
the total number of genes in the genome, i.e. n c = X c g ac < g. If one substitutes the values of the fits 
for transcription regulation into this equation one obtains an upper bound of approximately g < 70000 
genes. A tighter upper bound is obtained when one demands that n c cannot increase by more than 1 gene 
when g is increased by 1 gene, i.e. X c (g + l) Qc < X c g° c + 1 . One then obtains an upper bound of 
g < 34000 when the values for transcription regulation are substituted. In [ 5 1 an upper bound is derived 
by assuming that the number of genes involved in transcription regulation has to be less than half of the 
genome size, i.e. n c < g/2. With our fit this leads to an upper bound of about g < 30000. 

It is clear that all these upper bounds substantially overestimate the approximately 10000 genes of 
the largest observed bacterial genomes and that a more realistic theory is needed to plausibly explain the 
apparent size-constrained on bacterial genomes. In this regard it is also interesting to note that in all the 
upper bounds just proposed, the proportion of genes that are transcription regulators is at least 50% at 
the maximal genome size, whereas the percentage is at most 11% in the currently sequenced bacterial 
genomes. 



3.2 Consequences for the topology of the transcription regulatory network 

The approximately quadratic scaling of the number of transcription regulatory genes also has some in- 
teresting consequences for the structure of the transcription regulatory network as a function of genome 
size. The class of transcription regulatory genes consists for the most part of DNA-binding transcription 
factors that regulate transcription through the binding to specific regulatory motifs in intergenic regions. 
We can imagine the transcription regulatory network by a set of arrows pointing from each of the regu- 
lators to each of the genes that they regulate. The total number of arrows in this network for a genome 
of size g is the product of the number of genes g times the average number of incoming arrows per gene 
(r(g)), i.e. (r(g)) represents the average number of different regulators regulating each gene in a genome 
of size g. Note that we can also write the total number of arrows as the number of transcription factors 
ntx(g) in a genome of size g times the average number of genes (n(g)) that each regulator regulates. We 
thus have 

ntr(g)(n(g)) = (r(g))g & Pf^r oc g, (3) 

'Note that in principle nothing keeps a genome from increasing n c by more than 1 gene when g is increased by 1 gene. 
That is, some genes in other categories than c may be removed and replaced by genes in category c. 
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where the equation on the right follows from the quadratic scaling of the number of regulators: n tr (g) oc 
g 2 . To elucidate what the equation on the right implies, let us consider what follows if we assume that 
either (n(g)) = constant or (r(g)) = constant. In the first case the average number of genes regulated 
per transcription factor, i.e. the regulon size, is independent of genome size. In that case the average 
number of different transcription factors (r(g)) regulating each gene must be increasing linearly with 
genome size, i.e. (r(g)) oc g If on the other hand (r(g)) were constant with genome size, then the 
average regulon size (n(g)} should be decreasing with genome size, i.e. (n(g)) oc 1/g. Between these 
two extremes there is a continuum of solutions where {r(g)} increases more slowly, and (n(g)) decreases 
more slowly, but such that still (r(g)) / (n(g)) oc g. 

There is currently very little data to decide if real regulatory networks are closer to the limit where 
(r(g)} increases linearly, or closer to the limit where (n(g)) oc 1/g. One piece of indirect evidence is 
the dependence of the number of operons and the amount of intergenic region on genome size. If the 
average number of transcription factors (r(g)) regulating each gene were to increase with genome size, 
then one might expect that, as genome size increases, the average operon size should decrease and that 
the amount of intergenic region per gene should increase. It is of course a nontrivial task to identify 
the number of operons from genome sequence alone. However, as a proxy we may consider runs of 
consecutive genes that are located on the same strand of the DNA (see for a method of estimating 
operon number using this statistic). Since all genes in an operon necessarily have to be transcribed in 
the same direction, a decrease in operon number would likely be reflected by a decrease in the average 
length of runs of consecutive genes on the same strand. Figure 0] (left panel) shows this average length 
of iso-strand genes as a function of genome size for all bacterial genomes that are currently in the NCBI 
database of fully-sequenced genomes. The figure suggests a slight decrease of the length of these runs, 
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Figure 4: The average length of runs of genes that are transcribed from the same strand (left panel) and 
the average number of intergenic bases per gene (right panel) as a function of the total number of genes 
in the genome. Each dot corresponds to a bacterial genome in the NCBI database 

consistent with what was reported in [3 ], but there is a large amount of variation and the trend is far from 
convincing, i.e. r 2 = 0.23 under simple regression. Note also that the drop in operon size is at most 
a factor of two between the largest and smallest genomes, while total gene number increases almost a 
factor of 20. The right panel shows the average number of intergenic bases per gene as a function of the 
total number of genes in the genome. In this case a correlation between genome size and the amount of 
intergenic region is completely absent, i.e. r 2 = 0.0005. Thus these two statistics provide little evidence 
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that (r(g)) increases substantially with genome size. However, one cannot exclude the possibility that 
in small genomes a large proportion of genes is not transcriptionally regulated at all, and that as genome 
size increases this proportion drops dramatically. This would still lead to a substantial increase of (r(g)} 
with genome size. It does seem plausible, however, that larger genomes may have a larger number of 
'specialized' regulons that typically regulate a smaller number of genes compared to the more general 
regulators that one expects to find in all organisms, and that (n(g)} thus decreases with genome size. 

3.3 Quality of the fits 

The power-laws observed in figure|2]are observed for the large majority of the 154 ubiquitous functional 
categories. I assessed the quality of the power-law fits by a measure F that measures the fraction of the 
variance in the data that is explained by the power-law fit (see Methods). Figure [5] shows the cumulative 
distribution of F for all 154 ubiquitous functional categories. As can be seen from Fig. |5] about two 
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Figure 5: Cumulative distribution of the quality of the power-law fits for the 154 ubiquitous functional 
categories. The horizontal axis shows the fraction F of the variance in the data explained by the fit, 
and the vertical axis shows the percentage of categories that have at least a fraction F of their variance 
explained by the fit. 

thirds of the categories have more than 90% of the variance explained by the fit. More than 95% of the 
categories have more than 80% of the variance explained by the fit. We thus see that most of ubiquitous 
functional categories follow scaling laws like the ones shown in Fig. |2] 

4 Principle Component Analysis 

Instead of considering the scaling behavior of one functional category at a time, we can consider all 
functional categories at once. We can consider a 154-dimensional 'function space' in which each axis 
represents an ubiquitous functional category, and represent the 'functional gene content' of a genome by 
a point in this 154-dimensional space. That is, if we number the categories 1 through 154, and n c is the 
number of genes in category c, then n = (m, ri2, ■ ■ ■ , ni.54) represents the functional gene content of the 
genome as a point in the function space. The set of all sequenced genomes thus forms a scatter in this 
function space. 
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We can now ask what the shape is of this cloud in function space. To do this, it is convenient to again 
consider all axes on logarithmic scales. That is, we consider the scatter of points x with x c = log(ra c ). 
The results in the previous section showed that, to a good approximation, almost all function categories 
obey the linear equations 

x c = u c log(flO + 0c, (4) 

where g is the total number of genes in the genome. If this equation were to hold exactly for all categories, 
then the numbers x c and xg, for any two categories c and c, would also be linearly related: 

X c = —X c + Pc ~ —Pc- (5) 

a- c a- c 

Thus, the statement that the equation © holds for all categories is equivalent to the statement that the 
scatter of points in function space fall on a straight line. Of course, since equation © holds only approx- 
imately for each category, the scatter of points falls only approximately on a line. To illustrate this figure 
|6] shows three projections of the scatter of points onto three-dimensional subspaces each representing 
three functional categories. That is, each of these three figures shows the scatter of points with respect to 




Figure 6: Three 'slices' through the scatter of genomes in function space. Each point corresponds to a 
genome, and its components along the three axes correspond to the logarithms of the gene numbers in 
each of three categories that the axes represent. The scatter is shown with respect to the categories 'carbo- 
hydrate metabolism', 'transport', and 'RNA processing' (left panel); 'protein catabolism', 'DNA repair', 
and 'signal transduction' (middle panel); and 'cell cycle', 'electron transport', and 'amine metabolism' 
(right panel). The first three principle component axes are shown as the red, blue, and green lines in the 
figures (see below). 

only three of the 154 axes. The functional categories that were used for the projections are indicated in 
each of the plots and in the caption. As can be seen, the scatter of points indeed approximately falls on a 
straight line in each of these projections. 

The extent to which the scatter of points falls on a line can be quantified most directly using principle 
component analysis [11]. In principle component analysis one aims to represent a scatter of points in a 
high-dimensional space by a scatter in a much lower dimensional space. To this end it finds an ordered set 
of orthogonal coordinate axes in the n-dimensional space that have the property that, for any m < n, the 
sum of the squared distances of the data points to their projections on the first m axes of the coordinate 
system is minimized (i.e. no set of m axes has a lower sum of squared distances). That is, the first 
principle axis is chosen such that minimizes the average square-distance of the data points this axis is 
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minimized. The second principle axis is chosen such that the average square-distance of the data points 
to the surface spanned by the first and second principle axis is minimized. And so on for the further 
principle components. 

The ability of this coordinate system to represent the scatter of points is again measured by the 
fraction of the variance in the data that is captured by consecutive axes. For the scatter of 116 bacterial 
genomes in the function space of 154 ubiquitous categories, the first principle axis captures 22% of all the 
variance in the data. The second principle axis captures 6% of the variance, the third 5% of the variance, 
etcetera. Thus, the amount of variance explained drops sharply between the first and second principle 
axis. After that, the amount of variance explained by the data decreases smoothly, dropping between 5 
and 10 percent between consecutive axes. As many as 55 axes are needed to cover 95% of the variance 
in the data. These statistics suggest that only the first principle axis captures an essential characteristic 
of gene-content in bacterial genomes. Once this largest component of the variance is taken into account, 
one needs almost as many dimensions as there are genomes to explain the remaining variance in gene- 
content. 

Table|4]summarizes the statistics of the first three principle axes. Each of the axes is a vector a whose 
direction in function space is reflected by the relative sizes of the components a c . In the leftmost column 
of table|4]l have listed, for each axis, the 4 categories with the highest components a c and the 4 categories 
with the lowest components a c (redundant categories were omitted). The values of these components a c 
are shown in the second column. The projection of all genomes onto one of the principle axes gives 
another vector v where v g is the component of genome g in the direction of the principle axis. The third 
column in table |4]shows, for each principle axis, the 4 genomes with the highest components v g and the 

4 genomes with the lowest components v g . The components themselves are shown in column 4. These 
first three principle axes are also indicated as the red, blue, and green lines in figure |6] 

As can be easily derived from equation ©, the components a c of the first principle axis correspond 
precisely to the inferred exponents of figure|3] Thus, the entire collection of scaling laws is summarized 
in this single vector a. In summary, a large fraction of the variation in functional gene-content among 
all bacterial genomes can be summarized into a single vector which encodes how the numbers of genes 
in different functional categories increases and decreases as the total size of the genome varies. That is, 
as the genome size increases or decreases the numbers of genes in each functional category c increase 
or decrease as g ac . The vector a thus reflects a basic functional architecture of gene-content that holds 
across all bacterial genomes. 

The meaning of the second and third principle axis is less clear, and given that they only capture a 
relatively small amount of the variance it is not clear that they very meaningful at all. For both these 
axes the genomes at the extremes tend to be small parasitic organisms. This suggests that these axes may 
reflect the different types of parasitic lifestyles. 

5 Evolutionary interpretation 

What is the origin of the scaling laws discussed in the previous sections? In this section I will show 
that the observed scaling laws in fact suggest that there are fundamental constants in the evolutionary 
dynamics of genomes. 

Consider a particular genome with numbers of genes n c in different functional categories c. Consider 
next the evolutionary history of this genome. With this I mean that the current genome can be followed 
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First principle axis 


top and bottom categories. 


comp. 


top and bottom genomes 


comp. 


cell communication 


1.91 


Streptomyces coelicolor 


0.15 


signal transduction 


1.91 


Streptomyces avermitilis 


0.14 


regulation of transcription 


1.88 


Bradyrhizobium japonicum 


0.14 


electron transport 


1.46 


Rhizobium loti 


0.14 


hydrogen transport 


0.25 


Buchnera aphidicola 


-0.19 


protein biosynthesis 


0.17 


Mycoplasma pneumoniae 


-0.22 


ATP metabolism 


0.10 


Ureaplasma parvum 


-0.24 


rRNA modification 


0.06 


Mycoplasma genitalium 


-0.24 


Second principle axis 


porphyrin metabolism 


0.28 


Mycobacterium leprae 


0.16 


fatty acid metabolism 


0.20 


Prochlorococcus marinus 


0.16 


carboxylic acid biosynthesis 


0.19 


Wigglesworthia glossinidia brevipalpis 


0.16 


heterocycle metabolism 


0.14 


Candidatus Blochmannia floridanus 


0.15 


DNA alkylation 


-0.16 


Mycoplasma penetrans 


-0.22 


epigenetic regulation of gene expression 


-0.17 


Borrelia burgdorferi 


-0.22 


nucleoside metabolism 


-0.17 


Ureaplasma parvum 


-0.22 


aspartyl-tRNA aminoacylation 


-0.17 


Mycoplasma pulmonis 


-0.3 


Third principle axis 


protein secretion 


0.31 


Borrelia burgdorferi 


0.46 


cell communication 


0.15 


Chlamydia trachomatis 


0.26 


signal transduction 


0.13 


Treponema pallidum 


0.23 


DNA dependent DNA replication 


0.09 


Chlamydia muridarum 


0.23 


amino acid biosynthesis 


-0.16 


Streptococcus pneumoniae 


-0.13 


ribonucleotide metabolism 


-0.17 


Prochlorococcus marinus 


0.14 


purine nucleotide metabolism 


-0.18 


Wigglesworthia glossinidia brevipalpis 


-0.15 


ATP metabolism 


-0.18 


Candidatus Blochmannia floridanus 


-0.19 



Table 1: Summary of the first three principle axes. Each principle axis is a vector a in function space 
with component a c the component of the vector in direction of category c. For each principle axis the 4 
categories with the highest, and 4 categories with the lowest a c are shown (omitting redundant categories) 
in the leftmost column. The second column shows the a c . The location of each genome in function space 
can be expressed as a linear combination of principle axes. For each principle axes shown above the 4 
genomes with the highest and 4 genomes with the lowest components along the axis are shown in column 
3, and the values of the components of these genomes are shown in column 4. 
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back in time through the life of the cell the genome was taken from, through the cell division that 
produced it, through the life of its ancestral cell, and its ancestors, and so on. In this way the history 
of the genome can be traced back all the way to an ancestral prokaryotic cell from which all currently 
existing bacteria stem. During this evolutionary history the numbers n c have of coursed increased and 
decreased in ways that are unknown to us. That is, there is (unknown) functions of time n c (t) that 
describe the evolution of the numbers of genes in each category c in the genome under study. Similarly, 
there will be a function g(t) that describes the evolutionary history of the total number of genes in the 
genome. 

One can now ask what the constraints the functions g(t) and n c (t) should obey such that the observed 
scaling laws hold. To this end it is convenient to write the dynamics of n c (t) and g(t) in terms of effective 
duplication and deletion rates. That is, we write 

dn (t) 

-J^ = c {t)n c (t) - 5 c (t)n c (t) = Pc (t)n c (t), (6) 

and 

^ = 0(t)g(t) - 5(t)g(t) = p(t)g(t). (7) 

In these equations, (3 c (t) is the (time-dependent) average duplication rates of genes in category c, (3(t) 
is the average duplication rate of all genes, 6 c (t) is the average deletion rate of genes in category c, and 
S(t) is the average deletion rate of all genes. I have also defined the differences of duplication rates 
and deletion rates as p c (t) and p(t). Notice that, since in the above equations p c (t) and p(t) can be 
arbitrary functions of time, any time-dependent function can still be obtained as the solution of the above 
equations. Formally solving these equations we obtain 

n c (t) = n c (0) exp (J^ p c (r)(ir^ = n c (0) exp {(p c )t) , (8) 

and 

g(t) = g(0) exp ( jf p{r)d^j = g(0) exp ((p)t) , (9) 

where I have written the integrals as the time averages of p c and p times the total time t. Note that these 
averages are a function of the evolutionary history of the genome under study. If we now express n c (t) 
in terms of g(t) we obtain 

*W = i (ii&M < >l w/w - do) 

Note that, although this equation may appear to already imply the general scaling relations that were 
found, in fact it is completely general and holds for any set of evolutionary histories n c (t) and g(t). The 
equation is nothing more than a way of rewriting the relation between n c (t) and g(t) in terms of the 
averages (p c ) and (p) of this genome's history. 

The observed scaling relations only follow from (fTOl l if the same equation holds for all genomes. That 
is, if the variables n c (0), g(0) and (p c ) / (p) are the same for all evolutionary histories. This requirement 
is illustrated in figure^] The figure shows the evolutionary histories of a set of genomes in a phylogenetic 
tree. Each leaf of the tree represents one of the bacterial genomes, and at the root is the common ancestor 
genome of all the genomes at the leafs. We thus have a separate equation dTOt for each of the genomes at 
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Figure 7: Example of the evolutionary histories of 5 genomes. Each leaf (colored ball) represents a 
genome. Each white ball corresponds to a common ancestral genome of two or more of the genomes. 
The root of the tree is the common ancestor genome of all 5 genomes. Each colored arrow represents the 
evolutionary history of a genome. 

the leafs. The initial numbers n c (0) are g(0) in each of these equations are just the numbers of genes in 
category c and in the whole genome of the common ancestor, and it is therefore clear that the variables 
n c (0) and g(0) are indeed the same for all the equations 2 . 

Much more interesting is the requirement that the ratios (p c ) / (p) are the same for all evolutionary 
histories. These different evolutionary histories are indicated as colored lines in figure^ The requirement 
that {p c )/{p) be equal for all evolutionary histories thus demands that if one takes the average of the 
duplication minus deletion rate of genes in category c over the entire evolutionary history of a genome, 
and divides this by the average duplication minus deletion rate of all genes in the genome, that this ratio 
always comes out the same, independent of what evolutionary history is averaged over. That is, the ratios 
(p c ) I (p) are universal evolutionary constants and correspond precisely to the exponents a c = (p c ) / (p) 
of the observed scaling laws. 

So what determines (p c ) and (p)? It seems reasonable to assume that the rate at which genes are 
duplicated and deleted is mostly the result of selection. That is, as a first approximation we may assume 
that the rate at which duplications and deletions are introduced during evolution are approximately the 
same for all genes but that different genes have different probabilities of being selectively beneficial 
when duplicated or deleted. The rate p c (t) is then given by some overall rate 7 at which duplications and 
deletions are introduced 3 times the difference between the fraction of genes in the category that 

would benefit the organism when duplicated and the fraction of genes /~ (t) in the category that would 
benefit the organism when deleted: 

pS) = 7(/ c + w - fc~m (id 

2 One caveat is that we implicitly assume that n c (t) > for all t. A slightly more complex treatment is needed for categories 
that were not present in the ancestor, or that disappeared and reappeared during the evolutionary history of some genomes. 

3 In reality the rates at which duplications are introduced is unlikely to equal the rate at which deletions are introduced. We 
ignore this complication for notational simplicity. The theoretical development with this complication would be analogous. 
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We then have 

c ~ (p(t)) "</+-/->■ ( } 

The fractions /+ and /~ of course depend on the selective pressures of the environment. Thus, as 
one follows the genome through its evolutionary history, the demand for genes of a certain function 
fluctuates up and down, and this will be reflected in fluctuations of (/+ — /~). It seems intuitive clear 
that the size of the fluctuations is going to depend strongly on the functional class. That is, one would 
expect that demand for genes that provide an essential and basic function fluctuates relatively little. 
For instance, one would expect that even as the selective environment changes it is rare that existing 
protein biosynthesis genes become dispensable or that duplicates of these genes become desirable. On 
the other hand, the desirability of transcription regulatory genes and signal transducers is going to depend 
crucially on the selective environment in which the organism finds itself. It is thus not implausible that, 
if one averages over sufficiently long evolutionary times that the ratio (/+ — / c r)/(/ + — /~) always 
reaches the same limit, and that this limit is large for highly environment-dependent categories such as 
transcription regulation, and small for environment-insensitive categories such as protein biosynthesis 
and replication. The main open question that remains is the origin of the precise numerical values of the 
ratios (/+ — /~) / (/+ — / _ ) for different functional categories. 



5.1 The exponent for transcription regulatory genes 

The exponent a c for the category of genes involved in transcription regulation is close to 2, with the 99% 
posterior probability interval running from 1.72 to 1.95. Thus, even though the current data suggests that 
the exponent is slightly less than 2, it is tempting to think that the scaling might be simply quadratic. 
This is especially tempting given that this allows one to speculate more easily about the origin of this 
exponent. That is, it is easier to theorize about a quadratic scaling law then about a scaling law with 
exponent 1.83. Some theoretical explanations for the quadratic scaling of transcription regulators have 
recently been put forward 0. In this section I want to discuss these proposals and contrast it with my 
own suggestions in this regard in [ 18 1. 

One of the first things that of course come to mind when attempting to explain a scaling of the form 
n c oc g 2 is that in a genome with g genes, the number of pairs of genes scales precisely as g 2 . That is, 
there are on the order of g(g — l)/2 possible interactions in a genome with g genes. Therefore, if one can 
find any way to argue that the number of transcription regulators should be proportional to the number of 
pairs of genes, this would provide an explanation. In a recent paper [ 5 ] Croft and coworkers put forward 
two models for the observed (approximately) quadratic scaling of the number of transcription regulatory 
genes that both in essence attempt to argue that the number of regulators should be proportional to the 
number of potential interactions between genes, i.e. the number of pairs of genes in the genome. 

Although it is attractive to seek such a simple combinatorial explanation I believe that a simple 
survey of the known regulatory role real transcription factors play shows that such models are in fact 
highly implausible. If each regulator were to somehow 'correspond' to one or more pairwise interactions 
of genes then one would expect that the role of most regulatory genes would be to ensure that certain 
pairs of genes are expressed together or to ensure that certain pairs are not expressed together. 

This is, however, not what is observed. To name just a few known E. coli regulons: the factor 
crp responds to the concentration of cAMP in the cell, and its role is to make the activation of many 
catabolic pathways conditional on the absence of cAMP. The factor lexA activates a number of genes 



13 



that are involved in the repair of DNA damage in response to single stranded DNA. PurR represses a set 
of genes involved in de novo purine biosynthesis and responds to the concentrations of hypoxanthine and 
guanine. FadR senses the presence of long chain fatty acyl-coA compounds and in response regulates 
genes that transport and synthesize fatty acids. Finally, tyrR regulates genes that synthesize and transport 
aromatic amino-acids in response to levels of phenylalanine, tyrosine, and tryptophan. 

In all these cases, the role of the regulator is to sense a particular signal and to respond to this signal 
by activating or repressing a set of genes that implement a specific biological function which is related 
to the signal. In none of these example does it appear that the role of the regulator is in any way related 
to regulating specific interactions between pairs of genes. It thus seems that the number of different 
transcription regulators that the cell needs is much more a reflection of the number of different cellular 
responses to different environments that the cell is capable of. 

In [ 1 8 1 I provided a qualitative argument for the approximately quadratic scaling of the number of 
transcription regulatory genes. According to equation (fl2t the average difference (/+ — /~) between 
the fractions of transcription regulatory genes that would increase fitness when duplicated and those 
that would increase fitness when deleted is almost twice as large as the average difference (/ + — /~) 
between the fractions of all genes that would increase fitness when duplicated and those that would 
increase fitness when deleted. Let us separate the evolutionary history into periods in which the genome 
is growing (/ + > /~) and periods in which it is shrinking (/ + < / _ ), and let us for simplicity assume 
that /~ = /~ = during the former periods and /+ = / + = during the latter periods. We then 
have that during periods of genome growth /+ rj 2f + and that during periods of genome shrinkage 
/~ ~ 2/~. That is, during genome growth transcription regulators are twice as likely to lead to fitness 
increase when duplicated as average genes, and during genome shrinkage transcription regulators are 
twice as likely to lead to fitness increase when deleted. 

I want to suggest that the origin of this factor 2 is the switch-like function of transcription factors. 
Imagine a gene that has just emerged through a duplication. Originally the duplicate will be the same as 
its parent. At that point, the main change caused by this duplication that may affect fitness is a change 
in the dosage of the gene, i.e. from one to two copies. One would expect that the probability for this 
dosage change to have strong deleterious effects is approximately equal for transcription regulators as 
for genes in general. If the duplicated gene is to get fixed in the genome on a longer time scale, then 
a process of mutation and selection should modify the duplicated gene into a gene that increases the 
fitness of the organism. I suggest that the probability for this process to be successful is twice as high 
in transcription regulators as in general genes. In general genes the only avenue for beneficial change 
is a change in the molecular function of the gene, e.g. a metabolic gene may evolve to catalyze a new 
chemical reaction. Transcription regulators, however, may evolve to respond to a signal which the cell 
was previously insensitive to. They may evolve to affect the state of the cell both when this signal is 
present and when the signal is absent. Since this gives twice as many opportunities for a beneficial 
change, one may expect that there is twice as much probability of success. 

A similar argument hold for the rate of deletion. When deleting a non-regulatory gene, the cell simply 
changes to a state without the gene. When a transcription regulator is removed, one effectively removes a 
'switch' from the genome: the cell becomes insensitive to the signal that the regulator responded to. But 
there are again two 'ways' of implementing this sensitivity. The genes regulated in response to the signal 
may be turned constitutively on, or constitutively off. Thus there are two independent ways of removing 
a transcription factor, and one would thus expect the effective rate of transcription factor deletion to be 
twice the rate of deletion of general genes. 
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These arguments are of course highly speculative and may well turn out to be incorrect. However, 
they at least seem consistent with what we know about transcription regulation in bacteria and the process 
of evolution through gene duplication and deletion. Note also that very similar arguments as the ones 
just presented for transcription regulatory genes can be put forward for the category of signal transducing 
genes. These are indeed also observed to scale approximately quadratically with genome size. 

I suspect that it will be impossible to come to any solid conclusions regarding the cause of the 
approximately quadratic scaling for transcription regulators and signal transducers until we have better 
data on the genome-wide topology of the transcription regulatory and signal transduction networks in 
bacteria of differing sizes. 

Finally, I note that most of the functional categories have exponents that do not appear to equal small 
integers or even simple rationals. It is thus clear that simple arguments such as the ones discussed above 
will not be capable of explaining these exponents. What determines the average (/+ — f~)/ (f + — f~) 
for these categories is a fascinating question that at this point is completely open. 

6 Methods 

6.1 The number of genomes as a function of time 

The data for figure [J were obtained by extracting the submission dates of all the microbial genomes 
in genbank at 'ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria 1 from the 'gbs' file for each genome. 
When there were multiple submission dates the earliest date was taken. The logarithm of the num- 
ber of genomes as a function of time was then fitted to a straight line using standard least-square re- 
gression. This least-square fit gives the number of genomes n(t) as a function of time (in years) as 
n(t) = 2('~ 1993 - 4 )/ L38 . The exponential provides a reasonable fit, i.e. r 2 = 0.98 even though the data 
clearly suggest a decrease in the rate at which new genomes appear for the last 1 to 1.5 years. 

6.2 Gathering functional gene-content statistics 

The numbers of genes in different functional categories for each sequenced genome were obtained in the 
following way. Interpro annotations [ 1 1 of fully-sequenced genomes were obtained from the European 
Bioinformatics Institute \2\. Functional categories were taken from the gene ontology biological process 
hierarchy [4]. A mapping from Interpro to GO-categories was also obtained from the gene ontology web- 
site. Using this mapping I gathered, for each GO-category, all Interpro entries that map to it or to one of 
its descendants in the biological process hierarchy. I then counted, for each gene in each fully-sequenced 
genome and each GO-category, the number of Interpro hits h that the gene has to Interpro categories 
associated with the GO-category. A gene with many independent Interpro hits that are associated with 
the same GO-category is of course more likely to be a member of the GO-category than a gene with only 
a single hit. To quantify this, I chose the probability for a gene to be a member of a GO-category to 
which it has h hits to be 1 — exp(— Xh), with h = 3. The results presented in this chapter are largely 
insensitive to changes in h (see the discussion in [18]). 

In this way the number of genes associated with each GO-category in each genome were counted. I 
then selected all GO categories that have a nonzero number of counts in all bacterial genomes. There are 
154 such ubiquitous GO-categories. I also counted the total number of genes that have any annotation at 
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all for each genome. These are defined as genes that have at least one Interpro hit. Further discussion of 
this annotation procedure and its robustness can be found in [ 18 ]. 



6.3 Power-law Fitting 

In order to fit the data to power-law distributions I used a Bayesian procedure that is described in [18]. 
The main advantage of this fitting procedure with respect to simple regression is that the results are 
explicitly rotationally invariant. That is, the best line that the fitting procedure produces doesn't depend 
on the orientation of the coordinate axes with respect to the scatter of data points. 

The result is that the posterior distribution P(ot\D)da for the slope a of the line given the data D is 
given by 

(a 2 + l)( n -V/ 2 da 

where n is the number of genomes in the data, s xx is the variance in x-values (logarithms of the total 
gene numbers), s yy the variance in y-values (logarithms of the number of genes in the category), s yx 
is the co- variance, and C is a normalizing constant. For each of the 154 ubiquitous categories we then 
calculated the 99% posterior probability interval for the slope from equation (fT3l . 



6.4 Quality of the power-law fits 

To calculate the quality of the power-law fits, I first log-transform the data points (rii,gi) to (xi, yi) = 
(log(nj), log(g.j)). The best power-law fit is a straight line y = ax + j3 in the (x, y) plane. For each data 
point (xi,yi) I then find the distance di(l) to this line, and the distance d, (c) to the center of the scatter 
of points. That is, with (x) the average of the x-values, and (y) the average of the y-values, the distance 
di(c) is given by 

di(c) = ^{xi-{x)) 2 + ( yi -{y)y, (14) 
and the distance to the line is given by 



I then define the quality of the fit F as the fraction of the variance that is explained by the fit. 



References 

[1] R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, E. Birney, M. Biswas, P. Bucher, L. Cerutti, 
F. Corpet, M. D. Croning, R. Durbin, L. Falquet, W. Fleischmann, J. Gouzy, H. Hermjakob, 
N. Hulo, I. Jonassen, D. Kahn, A. Kanapin, Y. Karavidopoulou, R. Lopez, B. Marx, N. J. Mulder, 
M. Oinn, M. Pagni, F. Servant, C. J. Sigrist, and E. M. Zdobno. The interpro database, an inte- 
grated documentation resource for protein families, domains and functional sites. Nucleic Acids 
Res., 29(1):37^10, 2001. 



16 



[2] R. Apweiler, M. Biswas, W. Fleischmann, A. Kanapin, Y. Karavidopoulou, P. Kersey, E. V. Krivent- 
seva, V. Mittard, N. Mulder, I. Phan, and E. Zdobnov. Proteome analysis database: online applica- 
tion of interpro and clustr for the functional classification of proteins in whole genomes. Nucleic 
Acids Res., 29(l):44-48, 2001. 

[3] J. L. Cherry. Genome size and operon content. /. theor. Biol, 221:401-410, 2003. 

[4] The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat. Genet., 
25:25-29, 2000. 

[5] L. J. Croft, M. J. Lercher, M. J. Gagen, and J. S. Mattick. Is prokaryotic complexity limited by 
accelerated growth in regulatory overhead. Genome Biology, 5:P2, 2003. 

[6] M. Gerstein. A structural census of genomes: comparing bacterial, eukaryotic, and archael genomes 
in terms of protein structure. /. Mol. Biol, 274:562-576, 1997. 

[7] N. Guelzim, S. Bottani, P. Bourgine, and F. Kepes. Topological and causal structure of the yeast 
transcriptional regulatory network. Nat. Genet, 31:60-63, 2002. 

[8] M. Huynen and E. van Nimwegen. The frequency distribution of gene family sizes in complete 
genomes. Mol. Biol. Evol, 15(5):583-589, 1998. 

[9] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakahi. A comprehensive two-hybrid 
analysis to explore the yeast protein interactions. Proc. Natl. Acad. Sci. USA, 98:4569-4574, 2001. 

[10] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L. Barabasi. The large-scale organization of 
metabolic networks. Nature, 407:651-654, 2000. 

[11] I. T. Joliffe. Principle Component Analysis . Springer- Verlag, New York, 1986. 

[12] E. V. Koonin, Y. I. Wolf, and G. P. Karev. The structure of the protein universe and genome 
evolution. Nature, 420:218-222, 2002. 

[13] N. M. Luscombe, J. Qian, Z. Zhang, T. Johnson, and M. Gerstein. The dominance of the population 
by a selected few: power-law behavior applies to a wide variety of genomic properties. Genome 
biology, 3(8), 2002. 

[14] S. Maslov and K. Sneppen. Specificity and stability in topology of protein networks. Science, 
296:910-913, 2002. 

[15] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: simple 
building blocks of complex networks. Science, 298:824-827, 2002. 

[16] J. M. Stuart, E. Segal, D. Koller, and S. K. Kim. A gene-coexpression network for global discovery 
of conserved genetic modules. Science, 302:249-255, 2003. 

[17] P. Uetz, L. Giot, G. Cagney, T. Mansfield, R. Judson, J. Knight, and D. Lockshon et al. A compre- 
hensive analysis of protein-protein interactions in saccharomyces cerevisiae. Nature, 403:623-627, 
2000. 



17 



[18] E. van Nimwegen. Scaling laws in the functional content of genomes. Trends in Genet., 19(9):479- 
484, 2003. 

[19] A. Wagner and D. Fell. The small world inside large metabolic networks. Proc. Roy. Soc. London 
Series B, 268:1803-1810, 2001. 



18 



